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[57] ABSTRACT 

Apparatus and methods for near real-time stereo vision 
system for use with a robotic vehicle comprises two 
cameras mounted on three-axis rotation platforms, im- 
age-processing boards, a CPU, and specialized stereo 
vision algorithms. Bandpass-filtered image pyramids are 
computed, stereo matching is performed by least- 
squares correlation, and confidence ranges are esti- 
mated by means of Bayes* theorem. In particular, Lapla- 
cian image pyramids are built and disparity maps are 
produced from the 60 x 64 level of the pyramids at rates 
of up to 2 seconds per image pair. The first autonomous 
cross-country robotic traverses (of up to 100 meters) 
have been achieved using the stereo vision system of the 
present invention with all computing due aboard the 
vehicle. The overall approach disclosed herein provides 
a unifying paradigm for practical domain-independent 
stereo ranging. 

16 Claims, 5 Drawing Sheets 
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NEAR REAL-TIME STEREO VISION SYSTEM 


ORIGIN OF THE INVENTION 


The invention described herein was made in the per- 
formance of work under a NASA Contract, and is sub- 
ject to the provisions of Public Law 96-517 (35 U.S.C. 
202) in which the Contractor has elected not to retain 
title. 


TECHNICAL FIELD 


5 


10 


The invention relates to methods and apparatus for 
near real-time stereo vision, and in particular to such a 
system for use in making possible the autonomous cross- 
country traversal of a robotic vehicle, the system com- 15 
prising two parallel cameras mounted on special plat- 
forms, specialized image processing hardware, and 
unique stereo matching algorithms. 


BACKGROUND ART 20 

Practical considerations dictate that robotic vehicles 
be operable in both teleoperated and semi-autonomous 
modes. In the teleoperated mode, stereo cameras on 
board the vehicle may provide three-dimensional scene 
information to human operators via stereographic dis- 25 
plays. In the semi-autonomous mode, three-dimensional 
information is also required for automatic obstacle 
avoidance and must be provided by onboard rangefind- 
ers. 

Automatic stereo triangulation, or “stereo vision,” is 30 
a very attractive approach to onboard rangefinding, in 
part because the necessary video hardware is already 
required for teleoperation and in part because stereo has 
a number of potential advantages over other rangefind- 
ing technologies. These advantages include the fact that 35 
stereo is passive, nonscanning, nonmechanical, and uses 
very little power. 

The practicality of stereo vision has been limited by 
the slow speed of existing systems and a lack of consen- 
sus on basic paradigms for approaching the stereo prob- 40 
lem. The present invention uses a synthesis of ap- 
proaches based on area correlation, random field mod- 
eling, and compact, commercial hardware to produce a 
stereo system that gives range images from 60 X 64 ste- 
reo pairs at rates of up to 2 seconds per frame. 45 

Previous stereo vision work has been grouped into 
categories according to which geometric model of the 
world was employed, which optimization (i.e. search) 
algorithms were employed for matchings and which 
constraints were imposed to enhance the reliability of 50 
the stereo matching process. Primary approaches to 
geometry have been to use either feature-based or field- 
based world models. 

Feature-based approaches typically extract two-di- 
mensional points or line segments from each image, 55 
match these, and output the parameters of the corre- 
sponding three-dimensional primitives. Field-based 
models consist of discrete raster representations, in par- 
ticular the “disparity field” specifying the stereo dispar- 
ity at each pixel in the image. r 60 

Field-based approaches typically perform matching 
by area correlation. A wide variety of search algorithms 
have been used, including dynamic programming, gra- 
dient descent, simulated annealing, and deterministic, 
iterative “local support” methods. 65 

Coarse-to-fine search techniques using image 
pyramids can be combined with most of these methods 
to greatly improve their efficiency. Finally, many 
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sources of search constraint have been used to reduce 
the likelihood of false matches, including multispectral 
images, surface smoothness models, and redundant im- 
ages, as in trinocular stereo or motion-based bootstrap 
strategies. 

The application of statistical modeling and estimation 
methods has been growing in both feature-based and 
field-based approaches. The use of surface smoothness 
models, which is known to be effective in practice, is 
fitting into the statistical framework through a relation- 
ship to prior probabilities in Bayesian estimation. The 
power of coarse-to-fine search, redundant images, and 
“active” or “exploratory” sensing methods is well 
known. 

A basic issue is the question of which type of feature- 
or field-based model might provide the most general 
approach to stereo vision. The roots of stereo vision lie 
in the use of area correlation for aerial triangulation. In 
the machine vision community of the 1970s and 1980s, 
correlation was believed by many to be too slow or to 
be inappropriate for other reasons, so methods based on 
edges or other types of features became popular. How- 
ever, feature-based methods also have limitations due to 
feature instability and the sparseness of estimated range 
images. The present invention shows that correlation 
methods can be fast, computationally inexpensive, and 
potentially useful in many contexts. 

Another important issue is which combination or 
combinations of search algorithms and constraints pro- 
vide the most efficient and reliable performance. Pow- 
erful global search algorithms such as simulated anneal- 
ing and three-dimensional dynamic programming may 
give accurate results, but they are very expensive com- 
putationally. Analogously, using multispectral or re- 
dundant images provides more information, but in- 
creases the hardware and computational cost of a sys- 
tem. It is likely that comparatively simple methods will 
lead to fast and usually reliable performance, as shown 
in the paper “Practical Real-Time Imaging Stereo 
Matcher,” by H. K. Nishihara, published in the Septem- 
ber/October 1984 issue of Optical Engineering, volume 
23, number 5. 

The question then arises whether there are inexpen- 
sive performance metrics that can be used to determine 
when matching is not reliable, and whether such met- 
rics can be used to control switching between simple, 
fast procedures when these work and more powerful, 
expensive procedures when scene characteristics make 
them necessary. 

U.S. Pat. No. 4,905,081 to Morton discloses a method 
and apparatus for transmitting and receiving three-di- 
mensional video pictures. Transmission of video pic- 
tures containing depth information is achieved by tak- 
ing video signals from two sources, showing different 
representations of the same scene and correlating them 
to determine a plurality of peak correlation values 
which correspond to vectors representing depth infor- 
mation. The first video signal is divided into elementary 
areas and each block is tested, pixel by pixel, with each 
vector to see which vector gives the best fit in deriving 
the second video signal from the first. The vectors 
which give the best fit are then assigned to their respec- 
tive areas of the picture and constitute difference infor- 
mation. The first video signal and the assigned vectors 
are then transmitted in parallel. The first video signal 
can be received as a monoscopic picture, or alterna- 
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tively the vectors can be use to modified the first signal 
to form a display containing depths. 

As mentioned in the patent to Morton, the method 
can be used as a remote sensing technique for use with 
robots in hazardous environments. Such robots often 5 
use stereoscopic television to relay a view of their sur- 
roundings to an operator, and the technique described 
could be used to derive and display the distance of an 
object from a robot to avoid the need for a separate 
rangefinder. For autonomous operation of the robot, 1° 
however, information concerning the distance to a haz- 
ardous object in the environment of the robot must be 
available in near real-time. 

The slow speed of prior-art stereo vision systems has 
posed a major hurdle in the performance of semi- 15 
autonomous robotic vehicles. Semi-autonomy in combi- 
nation with teleoperation is desired for many tasks in- 
volving remote or hazardous operations, such as plane- 
tary exploration, waste cleanup, and national security. 

A major need has been a computationally inexpensive 20 
method for computing range images in near real time by 
cross-correlating stereo images. 

STATEMENT OF THE INVENTION 

25 

Apparatus for a near real-time stereo vision system 
for use with a robotic vehicle comprises two cameras 
mounted on three-axis rotation platforms, image-proc- 
essing boards, and a CPU programmed with specialized 
stereo vision algorithms. Bandpass-filtered image 3Q 
pyramids are computed, stereo matching is performed 
by least-squares correlation, and confidence images are 
estimated by means of Bayes’ theorem. 

In particular, Laplacian image pyramids are built and 
disparity maps are produced from the 60 x 64 level of 35 
the pyramids at rates of up to 2 seconds per image pair. 

All vision processing is performed by the CPU board 
augmented with the image processing boards. 

In accordance with the present invention, near real- 
time stereo vision apparatus for use with a robotic vehi- 4^ 
cle comprises a first videocamera attached to mounting 
hardware for producing a first video output image re- 
sponsive to light from an object scene and a second 
videocamera also attached to the mounting hardware 
for producing a second video output image responsive 45 
to light from the object scene; a first digitizer for digitiz- 
ing the first video image having an input connected to 
an output of the first videocamera, and having an output 
at which digital representations of pixels in. the first 
video image appear, and a second digitizer for digitizing 50 
the second video image having an input connected to an 
output of the second videocamera, and having an out- 
put at which digital representations of pixels in the 
second video image appear; a videoprocessor for suc- 
cessively producing sequential stereo Laplacian 55 
pyramid images at left and right stereo outputs thereof 
from the digital representations of the first and second 
video images at first and second inputs connected to the 
outputs of the first and second digitizers; a stereo corre- 
lation means for correlating left and right stereo Lapla- 60 
cian pyramid images at the left and right stereo outputs 
of the videoprocessor, said stereo correlation means 
having an output and having first and second inputs 
connected to the left and right inputs of the video- 
processor; and a disparity map calculator connected to 65 
the output of the stereo correlation means for calculat- 
ing a disparity map of the object scene and having stor- 
age for storing an array of numerical values correspond- 
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ing to the stereo disparity at each pixel of a digital repre- 
sentation of the object scene. 

The novel features which are characteristic of the 
invention will be better understood from the following 
description in connection with the accompanying draw- 
ings. It should be appreciated, however, that each of the 
drawings is given for the purpose of illustration and 
description only and that the drawings are not intended 
to be a definition of the limits of the present invention. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of the near real-time stereo 
vision system of the present invention. / 

FIG. 2 is a block diagram illustrating the calculation 
of Laplacian image pyramids. 

FIG. 3 is a block diagram of the hardware implemen- 
tation of the stereo Laplacian image pyramids. 

FIG. 4 is a flow diagram of the stereo correlation 
algorithm. 

FIG. 5 is a search graph candidate array for a joint 
1-D stereo matching algorithm. 

FIG. 6 is the- search graph candidate array of FIG. 5 
with edges, illustrating the graph structure. 

DETAILED DESCRIPTION OF THE 
INVENTION 

The present invention formulates stereo as a statisti- 
cal estimation problem by using correlation-type opera- 
tors to estimate the disparity field and associated uncer- 
tainty models, and by using fast, simple optimization 
algorithms. Considerable background information of 
use in practicing the invention is contained in the paper 
“Stereo vision for Planetary rovers: Stochastic model- 
ing to near real-time implementation” by Larry Mat- 
thies, published in Proc. SPIE Conference 1570, Geo- 
metric Methods in Computer Vision, SPIE, July, 1991 
(also available as technical report JPL D-8131, January 
1991). That paper by Matthies is hereby incorporated 
herein by this reference. 

Referring to FIG. 1, near real-time stereo vision ap- 
paratus for use with a robotic vehicle comprises a first 
videocamera 12, attached to mounting hardware 13, for 
producing a first video output image 14 responsive to 
light from an object scene; a second videocamera 16, 
also attached to mounting hardware 13, for producing a 
second video output image 18 responsive to light from 
the object scene; a first digitizer 20 for digitizing the 
first video image 14, having an input connected to an 
output of the first videocamera 12, and having an output 
at which digital representations 22 of pixels in the first 
video image 14 appear, and a second digitizer 24 for 
digitizing the second video image 18, having an input 
connected to an output of the second videocamera 16, 
and having an output at which digital representations 26 
of pixels in the second video image 18 appear; a video- 
processor 28 for successively producing sequential ste- 
reo Laplacian pyramid images 30 and 32 at left and right 
stereo outputs thereof from the digital representations 
22 and 26 of the first and second video images 14 and 18, 
having first and second inputs connected to the outputs 
of the first and second digitizers 20 and 24; a stereo 
correlation calculator 34 for correlating left and right 
stereo Laplacian pyramid images 30 and 32 at the left 
and right stereo outputs of the videoprocessor 28, hav- 
ing an output and having first and second inputs con- 
nected to the left and right outputs of videoprocessor 
28; and a disparity map calculator 36 connected to the 
output of stereo correlation calculator 34 for calculating 
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a disparity map of the object scene, said disparity map 
calculator 36 having storage for storing an array of 
numerical values corresponding to the stereo disparity 
at each pixel of a digital representation of the object 
scene. 

The cameras and videoprocessing and computational 
hardware needed for the present invention are commer- 
cially available. For example, the videocameras can 
both be either Hitachi model KP-140 or Hitachi KP- 
180. The Hitachi HP- 180 is shuttered for 1 /600th of a 10 
second, which helps to minimize problems associated 
with camera movement. The mounting hardware 13 is 
of conventional design, including two three-axis rota- 
tion platforms, designed so that first and second video- 
cameras 12 and 16 can be adjustably mounted on a 
robotic vehicle or the like with parallel optical axes and 
appropriate roll about the optical axis. 

The following videoprocessing boards, available 
from Datacube, can be used for the videoprocessor 28 
to practice the invention: Digimax boards (two needed), 
a Framestore board, ROI Store 2048 boards (two 
needed), a VFIR Mk II board, and a Max-SP Mk I 
board. The stereo correlation calculator 34 is imple- 
mented with a central processing unit (CPU) pro- 
grammed to execute an algorithm that will effect the 
required computations. For the central processing unit 
a Heurikon CPU board, which is based on a Motorola 
68020 microprocessor, can be used. Alternatively, a 
CPU board based on the 68040 chip is available from 
the Radstone Company. 

Three of the most important practical issues arising in 
the implementation of the stereo vision system of the 
present invention concerned orienting the cameras to 
align scanlines for efficient matching, dealing with local 
bias and contrast differences between the two images, 
and optimizing the algorithms to obtain near real-time 
speed with modest computing hardware. 

If a plane is formed from the focal points of the two 
videocameras and a given object point, the intersections 
of that plane with the image planes of the videocameras 
are known as “epipolar lines.” A “scanline” is a row of 
pixels in the video image. Stereo matching is most effi- 
cient and easiest to implement when corresponding 
epipolar lines are also corresponding scanlines of the 
two images. Nominally, such alignment can be achieved 45 
by orienting videocameras 12 and 16 with parallel opti- 
cal axes and with appropriate roll about the optical axes. 

A quick and convenient procedure for achieving 
approximate alignment is obtained by mounting video- 
cameras 12 and 16 on three-axis rotation platforms 
mounting hardware 13 and using an alignment target as 
a guide for manual adjustment of the platforms. In the 
alignment procedure, both cameras view the target 
while Datacube image processing boards subtract and 
display the images at frame rates. The alignment target 55 
has horizontal stripes, plus vertical stripes with the same 
spacing as the stereo baseline. Thus, when the cameras 
are properly aligned the stripes will overlay each other 
perfectly on the subtracted image. The cameras can be 
aligned very quickly by adjusting the rotation platforms 
while watching the display. 

Since focus affects the size of the image, it is also 
necessary to adjust focus so that the images have the 
same size. Imperfect optics and compromises made 
during the alignment procedure may actually leave the 65 
cameras focused at different distances; however, this is 
less of a problem than different image sizes and can be 
accommodated easily as described below. 
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Any remaining misalignment is compensated by per- 
forming stereo matching with low-resolution 60x64 
images from the image pyramids. These low-resolution 
image's effectively have a “larger” pixel size than the 
5 original 480x512 image, making them less sensitive to 
epipolar misalignment. If wide-angle lenses are used, 
the images may also have significant curvature from 
fish-eye distortions. This has been avoided by using 
16-mm lenses, giving a field of view of about 30 degrees. 

Applying stereo algorithms in the correlation calcula- 
tor 34 to the original intensity image from the video- 
processor 28 is adequate if there is negligible difference 
in bias and contrast between the two images; however, 
if such differences do exist, they may severely distort 
15 the disparity images. Possible methods for coping with 
this problem include the use of normalized or “pseudo- 
normalized” correlation operators, calibrating the dif- 
ferences and modelling them within the correlator, and 
high-pass or bandpass filtering the images before stereo 
20 matching. The filtering approach, which decreases sen- 
sitivity to bias and contrast differences at the cost of 
potential increase in matching ambiguity, was chosen 
because it is least expensive to implement and has been 
found to perform well in practice, as implemented via 
25 Laplacian pyramids as described below. 

To be practical for use in semi-autonomous naviga- 
tion, stereo systems must operate in a few seconds per 
frame or less, using computing hardware that can be 
carried on a vehicle. A widely used algorithmic ap- 
30 proach to the speed problem is to perform stereo match- 
ing by coarse-to-fme search with image pyramids. Re- 
ducing image resolution by a factor of two reduces the 
required amount of search by a factor of eight, since 
half the disparity range is searched for one quarter of 
35 the pixels. Also, lowering resolution reduces sensitivity 
to misfocus and epipolar misalignment between the 
cameras. 

In the preferred embodiment of the invention an 
original 480x512 image is reduced to 60x64 by com- 
40 puting Laplacian image pyramids in the videoprocessor 
28; matching is done at this level without extending the 
search to higher resolutions. Using Laplacian, as op- 
posed to Gaussian, pyramids reduces bias and contrast 
problems as noted above. 

With processors in the 68020 class, efficient pyramid- 
based algorithms alone are not sufficient to reach real- 
time speeds. For increased speed, the present invention 
uses special-purpose convolution hardware to assist 
pyramid creation and cross-correlation, implemented in 
50 a VME card cage with a 68020-based CPU board and a 
set of image processing boards manufactured by Data- 
cube. The Datacube image processing hardware in- 
cludes two digitizers, an 8x8 by 8-bit convolver, an 
ALU board and image memory. 

Stereo image pyramids are computed at the rate of six 
image pairs per second, and all of the remaining stereo 
computation is performed by the 68020-based CPU. It is 
assumed that the epipolar alignment is adequate to 
allow the search for stereo matches to be strictly one- 
60 dimensional within corresponding scanlines. The sum- 
of-squared-differences (SSD) correlation is computed at 
pixel offsets by the efficient moving average technique, 
using windows ranging in size from 5 X 5 up to 11x11 
(7 X 7 is most common). 

The noisy nature of sensors implies that any mathe- 
matical model of the stereo vision problem will include 
a statistical component, which is useful in constraining 
the design of a stereo vision system. The fundamental 
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elements of an estimation problem include the variables 
to be measured, the measurements available, the mathe- 
matical model relating the measurements to the vari- 
ables of interest, the mathematical model of the uncer- 
tainties present, and the performance evaluation crite- 5 
rion to judge which estimation algorithms are “best.” 

In a field-based model, the variables are the stereo 
disparity at each pixel in the image. Assuming the im- 
ages are obtained with ideal epipolar geometry, the 
disparity is a scalar giving the horizontal displacement 10 
of matching pixels in the two images. For static stereo 
matching (i.e. no-motion model), the most basic mea- 
surements are the intensities themselves, and the sim- 
plest model relating these to the variables of interest ^ 
(disparity) is the equation 

I\{x,y)^I£x~d(x,y), y) +n (x,y), 

where Ii and I r denote the left and right images, d is the ^ 
disparity, and n models noise. The preferred embodi- 
ment actually uses as measurements the differences of 
intensity between the two images in small windows 
around the pixel. In reality, many other factors can 
come into the measurement model besides additive 2 5 
image noise, including geometric and photometric 
image distortions and aspects of surface reflectance 
models, but the invention uses pragmatic methods to 
compensate for these effects without explicitly model- 
ing them. 30 

The uncertainty model includes both a probabilistic 
model of the noise n and a probability distribution that 
models prior knowledge of the disparity field d(x,y). In 
stereo and surface interpolation, Markov random fields 
have been used to model surface smoothness constraints 35 
imposed on the estimated disparity field. Alternatively, 
the prior distribution may model prior information 
available about the disparity at each pixel, such informa- 
tion coming from processing previous images or from 
other sources. 40 

Prior probabilistic models models can be classified 
according to whether the disparity field is modelled as 
totally uncorrelated (0-D modelling), correlated in one 
dimension, say within scanlines (1-D), or correlated in 
both dimensions. The degree of correlation affects the 45 
difficulty of the estimation problem. The present inven- 
tion makes use of the uncorrelated and one-dimensional 
cases. In the uncorrelated case, posterior probabilities 
can be derived that provide valuable confidence esti- ^ 
mates and may be useful for performance evaluation 
purposes. 

For both the 0-D and joint 1-D algorithm, posterior 
probabilities are estimated as described further below, 
using table look-up to compute the exponential func- ^ 
tions for the Gaussian distributions. For efficiency, sub- 
pixel disparity estimates are obtained by fitting parabo- 
las to the three SSD values surrounding the disparity 
estimates obtained at pixel resolution. 

The entire system computes range images from the 
60x64 level of the pyramid in approximately two sec- 
onds per frame for the 0-D algorithm and three to four 
seconds for the 1-D algorithm. Much of the SSD calcu- 
lation could also be done by the Datacube boards, and 
it is estimated that this would bring the time per frame 65 
to well under one second for the 0-D algorithm and 
under three seconds for the 1-D algorithm, Additional 
speed-up could be obtained by upgrading the CPU. 
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CALCULATING THE LAPLACIAN IMAGE 
PYRAMIDE 

Referring to the flow diagram shown in FIG. 2, an 
input image Go in the form of a digitized array is re- 
ceived by the videoprocessor 28. There it undergoes a 
low-pass filtering operation to produce an image G\ 
which is then subtracted from Go to leave an image Lo 
which retains the high-frequency information of the 
original image. The bandpass image Lo is called a “La- 
placian” image which has the low frequencies removed. 

Next the blurred image Gi can be “subsampled,” as 
shown in FIG. 2, by taking every other pixel in every 
other line to reduce the total number of pixels in the 
image by a factor of four. The subsampled image G\ is 
then lowpass-filtered to produce a filtered image G2 
which is subtracted from Gi.to give a reduced Lapla- 
cian image L\. Similarly, G2 can be subsampled and 
filtered to produce an even smaller Laplacian image L2. 
Further recursion gives Laplacian images L3, L4, and so 
forth. 

In the preferred embodiment of the present invention 
two 480x512 images are used to generate Laplacian 
pyramids in stereo. L3 in the present case is thus an 
60 X 64 pixel image, and is used to reduce the computa- 
tional time required for the stereo vision system and to 
obviate various alignment and nonuniformity problems. 

U.S. Pat. No. 4,718,104 to Charles H. Anderson de- 
scribes a filter-subtract-decimate hierarchical pyramid 
signal analyzing and synthesizing technique similar to 
that used in the present invention. That patent is hereby 
incorporated herein by this reference. 

In terms of the inverse of the frame rate of the video 
cameras employed, namely 1/30 second, the following 
sequence of events occurs: during the first frame the 
stereo pair (Go )i e ft and (Go )right are acquired, during the 
second frame (Lo)/ e /rand (G\)i e /t are calculated, during 
the third frame (Lo)/e/>and (Gi)„^/are calculated, and 
during the fourth frame and subsequently the remaining 
little images of the pyramid are calculated. 

As shown in FIG. 3, the algorithm for calculating 
Laplacian image pyramids was implemented by Data- 
cube imaging processing VME boards: two A/D digi- 
tizer boards 38 and 40 for the stereo pair, a frame store 
board 44 for storing three frames, two region-of-interest 
eight-frame store boards 46 and 48 for doing the sub- 
sampling or “decimation,” a video finite impulse re- 
sponse (VFIR) board 50 capable of lowpass filtering at 
video rates, and a general purpose arithmetic and logic 
unit (ALU) board 52 for doing the necessary subtrac- 
tions. All lines interconnecting the blocks representing 
the image processing boards in FIG. 3 correspond to 
8-bit data lines running at 10 Mbytes/second. 

During frametime 1, digital output signals represent- 
ing the stereo image pair are sent from first and second 
digitizer boards 38 and 40 via data buses to first and 
second inputs 54 and 56 of first ROI board 46, where the 
images are stored. During frametime 2, image (Go)/f?/ris 
read out of first ROI board 46 through output 58 and 
fed into input 60 of VFIR board 50 where it is lowpass 
filtered to become (G])/ e //. A delayed version of (Go) left 
is sent on through output 62 to input 64 of ALU board 
52. (G\)ieft is simultaneously sent to input 66 of second 
ROI board 48 and input 68 of ALU board 52. In ALU 
board 52 (Gi)/ e //is subtracted from the suitably delayed 
version of (Go)ieft entering input 64. The difference 
(Lo )ief t leaves through output 70 and goes to input 56 of 
first ROI board 46. (Gi)/ e //is also sent through output 72 
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of VFIR board 50 and goes to input 66 of second ROI 
board 48 where it is sampled. During frametime 3, 
(Go)right is read out of first ROI board and through a 
sequence of events identical to that just described be- 
comes (G\)ri g ht and (Lo bright* 

In successive frametimes the signals {G n+ \ )} e j t and 
{L n )left are generated from ( G n )} e ft and the signals 
{G n +\)right and (Upright are generated from (G n )„^,in a 
manner identical to that described above for G\ and Lo- 

A listing in the computer programming language “C” 10 
for the pyramid calculating algorithm is given in Ap- 
pendix A. 

Referring to FIG. 4, the stereo correlation algorithm 
begins with the left and right Laplacian pyramid images 
30 and 32 produced by videoprocessor 28 operating on 
the digitized signals 22 and 26. The left image 30 under- 
goes a shift procedure 74 over a selected range of dis- 
parity values from d min to d m ax and the right image 32 is 
subtracted from the shifted images in an operation 76 to 2Q 
give difference images 78. The individual pixel values of 
a particular difference image 78 undergo squaring 80 
and summation 82 over a local window of predeter- 
mined width. This produces sum-of-squared-differences 
(SSD) images 84 for each value of shift d (from d m f„ to 25 
dmax)- See, for example, a paper by P. Anandan titled 
“Computing dense displacement fields with confidence 
measures in scenes containing occlusion/* published in 
Proc. of DARPA Image Understanding Workshop, pp. 
246-246, SAIC, 1984. For each pixel, the minimum 30 
SSD over all d is found in minimization procedure 86 
and simultaneously a posterior probability calculation 
88 is carried out. The results of 86 and 88 are used in 
selection 90 of the probability of the best d. The SSD 
minimization 86 over d is used in the outputting 92 of 35 
the disparity image, and the selection 90 of the probabil- 
ity of the best d is used in the outputting 94 of the confi- 
dence image. 

Alternatively, the functional block representing mini- 
mization 86 over d can be replaced with a block repre- 40 
senting dynamic program optimization 96. 
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assuming that disparity is constant over a small window 
around xo, the measurements are the intensity errors 

e(xo + £LKi: d)=Ir(xQ+&Xj-d)-Ii(xo + A*/) 

where Ax,- indexes pixels in the window. The measure- 
ments e(xo+ Ax,- d) are expressed together as the vector 

. £(*<y </) = [<K* 0 + A** d), ■ - ■ > eixo+^Xn; d)] T 

where n is the size of the window. To simplify the nota- 
tion in what follows, e will generally be written without 
the indices. Under the noise model described above, the 
joint probability density function of e is 

where or 2 =:cri 2 -f o> 2 is the sum of the noise variances in 
both images. A maximum likelihood disparity estimate 
maximizes Equation (1), which is equivalent to minimiz- 
ing the quadratic form e r e in the exponent, the familiar 
“sum of squared differences” or SSD matching crite- 
rion. 

For digital images, disparity is estimated in two steps. 
First, e r e is evaluated for every discrete d in a prede- 
fined search range to find the minimum to a pixel level 
of resolution. This yields an initial estimate do of d at 
pixel resolution. Estimates to subpixel resolution can be 
obtained in many ways. Though not the most computa- 
tionally efficient approach, linearization (taking the 
first-order expansion of e about d=do) is instructive; it 
yields 

eixo+Axa 

d) - 1 4(^0 + A XjXd - do) + n^XQ + A x- 
/-rf 0 )-«l(*0+A*/). 

Since the noise terms are modeled as stationary, the last 
two terms can be abbreviated to n(xo+Ax/), the vari- 
ance of which is cr 2 . Collecting all e(xo + Ax,-d), 
I'(xo-f A x/), and n(xo+Ax/) into the vectors e, J, and n 
yields 


MAXIMUM LIKELIHOOD MODEL 

The left (Ii) and right (I r ) images of a stereo pair are 
modelled as displaced versions of the same deterministic 45 
signal with noise added to each image. Thus, 

/l(x)==/(*)-j-ni(*) 

JAx) = I(x + d(x)) + n Ax) , n 


e~J(d—do) + n 

For implementation, the derivatives V are estimated 
from Ij. Since Ii is noisy, the derivative estimates will 
also be noisy; this can be moderated by smoothing the 
image before differentiation. 

With the linearized model of e, the conditional den- 
sity of e is 


where I is the underlying deterministic signal, d is the 
displacement or “disparity” between images Ii and I r , 
and m and n r model the noise in each image as station- 
ary, Gaussian white sequences with zero means and 55 
variances 0*1 2 and cr?-, respectively. 

Maximum likelihood estimation is appropriate when 
the disparity is treated as a deterministic, unknown 
parameter. Using this method requires defining a set of 
measurements as functions of the unknown disparity, 60 
formulating the probability density of the measurements 
conditioned on the disparity, and determining the dis- 
parity estimate that maximizes the probability of the 
measurements. As measurements, the intensity differ- 
ences between the two images in small windows around 65 
the pixels being matched are used, or more generally, 
the differences of a set of linear filters applied to each 
image. Thus, to estimate the disparity at pixel Ii(xo), 


Ae\d)aexp ^ - “J \e - J(d - rf 0 )l r [£ - £(d - do)} J 

Taking the log of this and setting the derivative with 
respect to d to zero, the disparity estimate is obtained 


d= d 0 


-f 


jTe 

fi 


This can be iterated to refine the estimate. In practice, 
iterating will require estimating the intensity errors e at 
positions between pixels. This can be done by fitting 
curves to the discrete intensity image. 

The uncertainty in the disparity estimate is expressed 
by the variance of the estimation error, E[d 2 ' 
]~E[d — d) 2 ]. Assuming d is unbiased (E[d] = d), stan- 
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dard error propagation techniques lead to the following 
estimate of the error variance: 



( 2 ) 

5 


This expression is actually a lower bound on the error 
variance. 

Note that the variance ov 2 relates the precision of the 
disparity estimate to the noise level cr 2 and the “edgi- 10 
ness” of the images, as expressed by the squared inten- 
sity derivatives J 7 }, as well as to the number of pixels in 
the window. The intensity derivatives needed to esti- 
mate (T (p - can be computed from l\ before attempting to 
match, so the variance estimate can be used as an “inter- 15 
est operator” to decide where matching should be at- 
tempted. In fact, c rd 2 is essentially the inverse of the 
so-called Moravec interest operator. Note that the use 
of linearization to derive ov 2 makes it a local measure of 
precision around the chosen optimum of e r e; therefore, 20 
it does not reflect the possibility of match ambiguity 
should there be multiple, nearly equivalent local op- 
tima. This possibility is reflected in posterior probability 
estimates derivable from Bayesian formulations of ste- 
reo, as discussed below. 25 

Next, although the estimated disparity field has been 
modeled as uncorrelated, disparity estimates for pixels 
separated by distances r^w, where w is the width of 
the matching window, actually will be correlated be- 
cause of the overlap in the windows used to compute 30 
the estimates. (The presence of correlated noise in the 
images would also induce correlation in the disparity 
estimates.) This point is relevant to the joint 1-D formu- 
lation discussed below. It is also relevant when analyz- 
ing the uncertainty of surface slope estimates obtained 35 
by subtracting range estimates from adjacent pixels. In 
this case, the correlation in the range estimates will tend 
to reduce the relative range error between the two 
pixels; consequently, it will also reduce the variance of 
the slope estimates. 40 

Finally, the zero-mean, stationary, white, Gaussian 
noise model used here is convenient analytically, but it 
has shortcomings in practice. In particular, real stereo 
image pairs tend to have local differences in mean inten- 
sity and in contrast. In the implementation described 45 
below, this is dealt with by bandpass filtering the images 
before matching. 


BAYESIAN ESTIMATION 

Bayesian estimation is appropriate when probabilistic 50 
prior information is available about the disparity field. 
Most often this information has taken the form of prior 
smoothness models for the disparity field in surface 
interpolation problems, where the smoothness model 
can be cast as a prior probability distribution. Another 55 
possibility that is less commonly noted but very power- 
ful is that there may be prior disparity estimates at each 
pixel. This can arise in “bootstrap” approaches to stereo 
where a narrow-baseline image pair is used to constrain 
matching of a wide-baseline pair. Even when no explicit 60 
smoothness models are used and the only prior disparity 
information is a single, global search interval, Bayesian 
models can be used to derive practical models of the 
posterior probability distribution for the disparity field, 
at least in the “0-D” case. Such distributions are useful 65 
in characterizing the uncertainty in disparity estimates. 
They may also provide a quantitative performance met- 
ric for comparing the effectiveness of alternate stereo 
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algorithms, such as the relative effectiveness of using 
color versus greyscale images. These issues for 0-D 
models and joint 1-D Bayesian models are considered 
below. 

In the following, the conditional probability density 
function f(e | d) of the measurements e, given disparity d, 
will continue to be used. This invention is concerned 
only with discrete disparities (i.e., pixel resolution), so 
probabilistic information about disparity will be mod- 
eled by discrete distributions with probability functions 
f(d). Thus, for disparity estimation at individual pixels, 
Bayes, theorem gives the posterior probability of the 
disparity d, given the measurements e, as in a Ph.D. 
dissertation by D. B. Gennery titled “Modeling the 
environment of an exploring vehicle by means of stereo 
vision,” Stanford University, June 1980: 


Ad |e) = 


At\d)Ad) 

lAeMWi) 


(3) 


where f(d) is the prior p.f. of d /, i indexes the range of 
possible disparities, and the conditional density f(e | d) is 
as in Equation (1). Optimal estimates of d by the maxi- 
mum posterior probability (MAP) criterion will now be 
defined. For a given set of measurements, the denomi- 
nator is a constant normalizing term that is not needed 
to obtain the MAP estimate, though it is needed to 
determine the actual posterior probabilities. It is as- 
sumed that any prior information about d comes from 
external sources, such as previous images, a laser scan- 
ner, or a map database, and is independent of the image 
noise. 

When the only disparity information available is a 
minimum and a maximum disparity limit, it is reasonable 
to model the prior distribution as uniform over this 
interval, as in the Ph.D. dissertation by D. B. Gennery 
referred to above and incorporated herein by reference. 
In this case, the prior distribution cancels in Equation 
(3) leaving only a constant normalizing term in the 
denominator: 


Ad\eJ = 


Ae\d) 

lifiejd) 


aAejd). 


(4) 


Thus, in this case, the MAP and the ML disparity esti- 
mates are the same and are obtained by minimizing the 
squared error criterion e r e. However, the Bayesian 
formulation also provides a posterior probability for 
each possible (discrete) disparity obtained simply by 
performing the normalization in Equation (4). This can 
be done efficiently by using table look-up to evaluate 
the exponential function in the Gaussian model for 
f(e|d) as done in the Ph.D. dissertation by D. B. Gen- 
nery. A posterior variance estimate can be obtained by 
computing the second moment of the posterior distribu- 
tion about its mean, as also done in the Ph.D. disserta- 
tion by D. B. Gennery, or it can be approximated via 
the linear error propagation used earlier. 

How can the posterior probability distribution be 
interpreted? First, the probability of the MAP estimate 
can be thresholded to determine whether there is suffi- 
cient confidence in the disparity estimate to accept it. 
This is done in the preferred embodiment of the inven- 
tion. Low probabilities for the MAP estimate can sig- 
nify that the posterior distribution has a broad, low 
peak, because the pixel is in a textureless area of the 
image, or that the posterior distribution has multiple 
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peaks due to matching ambiguity. In principle, these 
cases can be distinguished; for example, by also examin- 
ing the variance estimate/interest operator described 
earlier. In the case of ambiguity, other techniques could 
be invoked to resolve the ambiguity, such as surface 5 
smoothness criteria or exploratory camera motions In 
the present invention, the only action performed is ac- 
ceptance or rejection of the estimate based on a proba- 
bility threshold 

If a more informative distribution is available than the io 
uniform distribution, the required course of action de- 
pends on how the prior is modeled If the prior is tabu- 
lated as a discrete probability function, the numerator of 
Equation (3) for each possible disparity may be evalu- 
ated and the disparity estimate may be chosen as that d 15 
which maximizes the numerator. The posterior proba- 
bilities would be obtained as before by normalizing by 
the sum in the denominator. Rather than a discrete 
tabulation, the prior may be modeled as a continuous 
p.d.f. at each pixel. In particular, if the prior is modeled 20 
as a Gaussian density with mean d~ and variance s~, 
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objective functions for which it is very difficult to find 
the global optimum. 

A compromise approach models only one- 
dimensional interactions within each scanline. This is 
referred to as “joint 1-D” modeling, because joint esti- 
mators are derived for the vector d of disparity across 
an entire scanline A statistical approach is taken by 
deriving the estimators as extensions of the Bayesian 
model discussed above. Two models are developed for 
the prior probability distribution of d, one of which is a 
1 -D specialization of the membrane smoothness model, 
and one of which is a new model that assumes the exis- 
tence of prior disparity information at each pixel. The 
MAP criterion is again employed to define the optimal 
disparity vector d and show that, within each scanline, 
the globally optimal estimate can be found by dynamic 
programming. The resulting algorithms are relatively 
fast, perform very well, and can be applied indepen- 
dently and in parallel to each scanline. 

To derive the model, the joint version of Bayes’ theo- 
rem is written as 




25 


Ad\eJ = 




then the MAP estimate to pixel resolution may be ob- 
tained by maximizing the log-likelihood 

/ - , N (5) 

Kd) = \nfid\e) = - ±-( -L- fe + - to -A - L V K 


The measurement vector e now includes measurements 
for all pixels across the scanline. For illustration, we 
will first recast the results of the previous section in 
terms of the joint model. For simplicity, suppose that in 
matching each pixel intensity differences at only that 
pixel are used. Then for each pixel I(x/) the measured 
intensity error is 


where is a constant, or equivalently by minimizing the 
expression in braces. This expression is just a combina- 35 
tion of the intensity error term from before, weighted 
by the inverse noise variance, with a quadratic penalty 
for deviation from the prior estimate, weighted by the 
variance of the prior estimate. Intuitively, the local 
minima of (l/<x 2 )e r can be viewed as defining candidate 40 
disparities and the penalty term as influencing which 
candidate is considered optimal. 

Sub-pixel disparity estimates can be obtained by ear- 
lier. Furthermore, for nonuniform prior distributions, 
one can derive a posterior mean and variance that rep- 45 
resent the optimal combination of information from the 
prior distribution and measurements from the new im- 
ages. 

JOINT 1-D MODEL AND ALGORITHM 5Q 

The previous section modeled the disparity field as 
uncorrelated and then described algorithms for estimat- 
ing the disparity independently at each pixel. These 
algorithms are very simple and efficient, and they have 
been found to be very effective in indoor and outdoor 55 
operation. Indeed, this approach was motivated in part 
by arguments made by Nishihara in a paper titled “Prac- 
tical Real-Time Imaging Stereo Matcher,” published in 
the Sept ember /October 1984 issue of Optical Engineer- 
ing, volume 23, number 5, that simplicity is important in 60 
stereo vision systems. In contrast, more complex ap- 
proaches involve the use of explicit surface smoothness 
models and more global optimization methods. 

Smoothness models can improve reliability, as sur- 
face interpolation can be used to fill in gaps in the esti- 65 
mated disparity field. Usually two-dimensional surface 
models that impose constraints both within and across 
scanlines are employed. Unfortunately, this leads to 


e(x p'dj) — 1 t(*/— d i) 


where d / is the disparity at x/. The vector of measure- 
ments over the entire scanline is 

e(x;d)=[e(xj;d l ), . . . , e(x A7<y)] r 

where N is the length of the scanline Dropping the 
indices on e; its joint p.d.f. is 


Since the disparity field was modeled as uncorrelated, 
the joint prior distribution f(d) is just the product of the 
individual distributions for each pixel, 


Ad) = n 


which for the Gaussian case is 


Ad) - (27 r)-W2 


no-,- 


( 1 _ to - dr ) 2 \ 

)- 


(,2t)- s/2 I W d ~ | W2 exp^ -j-ld- d-] T W d ~\d -S-J \ 
with W d~ being the diagonal inverse covariance matrix 
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W rf 7=[diag(si- ..^SA-)]- 1 . 

The posterior distribution is 

5 

i 

The MAP estimate is obtained by taking the log of io 
this expression and finding the vector d that maximizes 
the result. Because the log-probability is a sum of terms 
like Equation (5), the optimal d is obtained by choosing 
the estimates for each pixel independently ^ 

One class of models in which the inverse prior covari- 
ance matrix Wrf- is not diagonal arises from regulariza- 
tion-based approaches to stereo vision and surface inter- 
polation. The original motivation for these approaches 
came from physical models of surface smoothness con- 
straints, wherein penalty terms for deviation from 
smoothness were modeled by the energy of deforming 
membranes or thin plates. Later, it was recognized that 
these physically-based models had equivalent formula- 25 
tions as Bayesian estimation problems, where the defor- 
mation energy was modeled by a prior probability dis- 
tribution. Specifically, when the issue of discontinuities 
is ignored, regularization-based penalty functions are 30 
equivalent to modeling the disparity field as a zero- 
mean, stationary random field. This equivalence for a 
1~D version of the membrane model regularizer will 
now be demonstrated, and it will be shown how the 3? 
MAP estimate for the resulting model can be obtained 
with dynamic programming. 

Following a paper by T Poggio, V Torre, and C. 
Koch titled “Computational vision and regulation the- 
ory,” published in Nature, 3 1 7(n):3 14— 3 19, September, 
1985, a formulation of stereo with a 1-D membrane 
model regularizer can be written as the problem of 
finding the function d(x) that minimizes 

45 

2 +Md'(x)) 2 }dx, (7) 

where X is a blending constant The term (d'(x)) 2 penal- 
izes departures of the estimated disparity field from zero 5Q 
slope; that is, it biases the algorithm to prefer surfaces 
that face the cameras directly. A discrete version of this 
formulation is obtained by using differences to approxi- 
mate d'(x) and writing 
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= 


l -1 0 
-1 2 -1 
0 -12 


0 0 
0 0 
0 0 


( 9 ) 


40 


55 


9(g) = f .f [«* - 4) - h l(*£>] 2 ) + X (rf,+ i - di? j, 

60 

so that a cost function is obtained by summing over 
each pixel in the scanline. With this cost function, the 
disparity vector d — [dj, . . . , d^^that minimizes the 
total squared intensity error across the scanline plus the ^ 
weighted, total “deviation from flatness” of d is sought. 
The second summation is equivalent to the quadratic 
form d^Wmd, with 


Thus, in 1-D the membrane model penalty term is 
equivalent to a quadratic form with a tridiagonal coeffi- 
cient matrix. 

To relate this to MAP estimation, one can work back 
from Equations (8) and (9) to probability distributions. 
The regularizing term is equivalent to the exponent of a 
prior distribution model of d as a degenerate Gaussian, 
with zero mean vector and singular inverse covariance 
matrix W m . Therefore, minimizing Equation (8) is 
equivalent to MAP estimation for the Gaussian model 
with d~=0, W^-=W m and f(e|d) as in Equation (6). 

From a practical standpoint, this example is particu- 
larly valuable in stereo vision because it is a case in 
which the MAP estimate for the entire scanline can be 
computed efficiently by a dynamic programming opti- 
mization algorithm. This can be seen by mapping Equa- 
tion (8) into a problem of finding a minimum-cost path 
in a weighted, directed graph Vertices of the graph are 
the possible disparities at each pixel This is shown in 
FIG. 5, which represents the scanline coordinate on the 
horizontal axis, the disparity coordinate on the vertical 
axis, and possible disparities for each pixel as the dots in 
each column. Edges of the graph join the vertices in 
adjacent columns, as in FIG. 6 Costs are assigned to the 
vertices from the intensity error term of Equation (8) 
and to edges from the squared derivative term. Thus, 
the value of the entire objective function for any d is 
obtained by adding the vertex and edge costs in the 
corresponding path through the graph, and an optimal d 
is one with minimum-cost path. 

For graphs with this structure, it is well-known that 
minimum-cost paths can be found by dynamic program- 
ming (DP). DP has been applied previously to feature- 
based approaches to stereo that seek to match extracted 
edges or intervals between edges, as in a paper by Y. 
Ohta and T Kanade titled “Stereo by intra- and inter- 
scanline search using dynamic programming,” pub- 
lished in IEEE Trans Pattern Analysis and Machine 
Intelligence, PAMI-7(2): 139-1 54, March 1985. The 
approach here differs by being based on a type of cross- 
correlation at every pixel and by deriving cost terms 
from the statistical model; nevertheless, once the graph 
and the weights are defined, the optimization algorithm 
is very similar. The data structure necessary for the 
algorithm consists of two fields for each vertex. One 
field stores the total cost of the optimal path up to this 
node; the other is a black pointer that indicates the 
immediately preceding vertex on this path (i.e., for 
column i, it points to a vertex in column i-1). The algo- 
rithm itself consists of a forward pass that computes the 
path costs and the back pointers, followed by a reverse 
pass that traces the back pointers to extract the optimal 
path from the data structure. See Table 1 below. The 
forward pass considers each column of the array in turn 
from left to right across the scanline Each column is 
processed with a nested loop that considers each vertex 
in that column and for each vertex in the current col- 
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umn considers the edges to all vertices in the previous 
column For each incoming edge, the cost of the optimal 
path traversing that edge is the sum of the optimal cost 
at the previous vertex and the costs for this edge and the 
current vertex. The minimum cost over all incoming 5 
edges is stored at the current vertex together with a 
pointer to the appropriate vertex of the previous col- 
umn. This procedure is repeated for all vertices in the 
current column and then for all columns across the 
scanline The reverse pass locates the end of the optimal io 
path by finding the minimum-cost vertex in the final 
column and then traces the back pointers from there. 
The correctness of this algorithm can be proven by 
induction on the path length The computational com- 
plexity of the algorithm is 0(NC 2 ), where C is the maxi- 15 
mum number of candidate vertices per column If all 
disparities in a fixed search range are considered as 
candidates, then C is the width of the search range. 
Table 1 below gives a pseudocode listing of the algo- 
rithm 20 

TABLE 1 

Forward pass: 

For each column i 
For candidate j 

Compute vertex cost 25 

For all candidates in column i-1 
Compute edge costs 
Compute total path cost as sum 
of previous total path 
cost, edge cost, and new 
vertex cost 

Choose the minimum total cost among 
incoming paths, assign it as the 
total cost to this vertex, and 
set a back pointer to the appro- 
priate vertex of the previous 
column. 

Reverse pass: 35 

Find the minimum total cost in Column N 
Trace the back pointers to find the optimal path. 


To summarize, joint 1-D estimators of the disparity 
vector d within each scanline have been derived by 40 
applying the MAP criterion to a Bayesian formulation 
of the 1-D stereo problem It has been shown that one 
such estimator is equivalent to combining the SSD 
cross-correlation measure with the 1-D version of the 
membrane model regularizing function, and that opti- 45 
mal estimates in this case can be obtained with dynamic 
programming applied independently to each scanline. 

In the derivation, it was assumed that intensity differ- 
ences were computed only for lXl windows around 
each pixel. The same approach applies to larger rectan- 50 
gular windows that overlap adjacent scanlines This 
algorithm has the convenience that the embedded sur- 
face model will interpolate over textureless regions of 
the image. Also, the likelihood of gross errors in dispar- 
ity can be reduced by heavily weighting the surface 55 
model term, though at the cost of over-smoothing the 
estimated disparity field In comparison to other stereo 
algorithms using surface smoothness models with 2-D 
interactions, with the present approach it is possible to 
find the globally optimal disparity estimates* (within 60 
each scanline) efficiently. This algorithm has been used 
very successfully in outdoor autonomous navigation. 

As indicated earlier, a related prior probability model 
can be developed for the case in which prior disparity 
information is available at each pixel. The independent 65 
model used for illustration at the start of this section was 
one example of such a model. In that case, the prior 
mean vector d - was not zero, but could specify a differ- 
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ent value at each pixel The inverse covariance matrix 
was diagonal. This can be generalized to include corre- 
lation in the prior distribution, which will lead to a 
nondiagonal inverse covariance matrix. Such correla- 
tion will exist if the prior disparity information was 
obtained by matching with previous images, as noted 
above. Such correlation can be modeled with the first- 
order Markov covariance function p M with 0<p<l. 
That is, the prior covariance between pixels x/ and xj 
would be given by the covariance function 

( 10 ) 

This covariance function has the inverse covariance 
matrix where C ” 1 is 


1 

O 

... 0 0 

-P 1 + P 2 ~P 

... 0 0 

0 -p 1 + p 2 

... 0 0 

0 0 0 

...-pi 



Since this is tridiagonal, this model would allow the 
same DP optimization algorithm to be used to estimate 
disparity as was described above for the regularization - 
based prior Equations (10) and (11) show this model as 
being stationary; however, it can be generalized to the 
nonstationary case with different variances at each 
pixel. 

To summarize, a joint 1-D disparity estimation algo- 
rithm based on a 1-D membrane surface smoothness 
model and dynamic programming optimization has 
been described. Results with this algorithm have been 
very good, indoors and outdoors. The algorithm is 
slower than the “0-D,” cross-correlation algorithm 
discussed earlier; moreover, because the computational 
complexity of DP is quadratic, the relative speed de- 
grades quickly as the disparity range increases. Refer- 
ring to FIG. 1, the invention also encompasses a method 
of near real-time stereo vision comprising the steps of 

( 1 ) producing a first video output image responsive to 
light from an object scene; 

( 2 ) producing a second video output image respon- 
sive to light from the object scene; 

(3) digitizing the first video signal to produce digital 
representations of the first video image; 

(4) digitizing the second video signal to produce digi- 
tal representations of the second video image; 

(5) videoprocessing the digital representations by 
successively producing sequential Laplacian stereo 
image pyramid pairs from the digital representa- 
tions of the first and second video images; and 

( 6 ) calculating a disparity map of the object scene and 
storing an array of numerical values corresponding 
to the disparity at each pixel of a digital representa- 
tion of the object scene. 

In a first variation of the method of the invention the 
calculating of the disparity map includes estimating 
disparity independently at each pixel, using a so-called 
0-D algorithm In a second variation of the method of 
the invention the calculating of the disparity map in- 
cludes estimating disparity jointly over a row of pixels 
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in a scanline, using a so-called 1-D algorithm. Using the 

0- D algorithm saves computational time, but using the 

1- D algorithm is perhaps more reliable. 

Referring to FIGS. 2 and 3, the videoprocessing of 
step (5) above operates on the digital representations of 5 
the first and second video images in the form of digi- 
tized image arrays (Go)/ e / r and (Go)right in the following 
sequence of steps: 

(a) low-pass filtering (Go)/ e / f to produce an image 

(Gi )iefu 10 

(b) subtracting (Gi)/ e / /t from (Go )iefi to produce an 
image (Lo )lefi which retains high-frequency infor- 
mation of (Go 

(c) subsampling (Gi )i e / h by taking every other pixel in 
every other line to remove the total number of 15 
pixels in the image by a factor of four and produce 

a subsampled 0 Gi )/<?//; 

(d) lowpass-filtering the subsampled (Gi )i e fi to pro- 
duce a filtered image (Gi )/ e /,; 

(e) subtracting from the subsampled (Gi)/ e / r to pro- 20 
duce an image (L\)i e /t which retains high-frequency 
information of (G\ )j e /t; 

(f) subsampling (G 2 )i e ft> by taking every other pixel in 
every other line to remove the total number of 
pixels in the image by a factor of four and produce 25 
a subsampled (G 2 )/<?//; 

(g) lowpass-filtering (G 2 )ieft to produce a filtered 
image (G 3 

(h) subtracting (G 3 )i e fi from (G 2 )j e ft to produce an 

image which retains high-frequency infor- 30 

mation of (G 2 )i e fu and 

(i) performing steps (a) through (h) as described 
above starting with (Go)ng^/- 

In terms of an inverse frame rate of the videocameras, 
the following sequence of events occurs during a first 35 
frametime the stereo pair (Go )ieft and (Go )right are ac- 
quired, during a second frametime the (Lo)/<?/r and 
(Gi)/ e /;are calculated, during a third frametime (Lo)right 
and (GOngA/are calculated, and during a fourth frame- 
time and subsequently the remaining Laplacian pyramid 40 
images are calculated. 

The method of the present invention can additionally 
comprise further recursion of steps (a) through (i) de- 
scribed above to give left and right Laplacian pyramid 
images L 3 , L 4 , and so forth as may be necessary or 45 
desired in a particular application. 

In the present invention, one further recursion is 
carried out for left and right Laplacian images L 3 of size 
60x64 pixels. 

Referring to FIG. 4 , in the above description of the 50 
method of the invention the calculating of the disparity 
map includes the stereo correlation of the left and right 
stereo Laplacian pyramid images produced as described 
from the digital representations of the first and second 
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video images. The stereo calculation is effected by the 
steps of 

shifting the left Laplacian pyramid image over a se- 
lected range of disparity values from d . m / R to d max to 
produce a series of shifted Laplacian pyramid images; 

subtracting the shifted Laplacian pyramid images 
from the right Laplacian pyramid image to give a differ- 
ence image consisting of a plurality of difference values 
corresponding to the plurality of individual pixels; 

squaring the difference values of a particular differ- 
ence image to produce a plurality of squared difference 
values; 

summing, over a local window of predetermined 
width, the plurality of squared difference values which 
produces a sum-of-squared-differences (SSD) image for 
each value of shift d from d m/ >, to d max to produce a 
plurality of such SSD images; 

finding the minimum over d for the SSD at each pixel 
and carrying out a posterior probability calculation; 

using the minimum over d for the SSD and the poste- 
rior probability calculation to select the probability of 
the best d; and 

using the minimum over d for the SSD to output a 
disparity image, consisting of a plurality of individual 
disparity values. 

In a variation of the method of the invention, the 
finding of the minimum over d is replaced by dynamic 
programming optimization, and the calculating of the 
disparity map includes estimating disparity jointly over 
a row of pixels in a scanline. 

The method described above may additionally com- 
prise the steps of: 

(i) calculating an output confidence image of the 
object scene, and 

(ii) storing an array of numerical values correspond- 
ing to the output confidence image. 

In particular, the calculating of an output confidence 
image of the object scene in step (i) above comprises 
using the minimum over d for the SSD and the posterior 
probability calculation to select the probability of the 
best d. 

The first autonomous cross-country robotic traverses 
(of up to 100 meters) have been achieved using the 
stereo vision system of the present invention with all 
computing done aboard the vehicle. The overall ap- 
proach disclosed herein provides a unifying paradigm 
for practical domain-independent stereo ranging. 

Those having skill in the arts relevant to the present 
invention will undoubtedly think of various obvious 
modifications or additions to the invention based upon 
the preferred embodiment disclosed herein. Therefore, 
it should be understood that the invention is not to be 
limited to the disclosed embodiment, but is to be limited 
only by the scope of the following claims. 
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linclude <vxWorks.h> 

^include "generics. h" 
linclude "taskLi'b.h" 

♦include "roi.h" 

♦define MaxLevel 6 
IMPORT roilmage *openRoiImage ( ) ; 

IMPORT roilmage *A0,*A1,*A2,*A3,*B0,*B1,*B2, *B3,*C0, 

*C1,*C2,*C3,*D0,*D1,*D2,*D3; 

IMPORT RS_DESC *roiLeft, *roiRight ; 

IMPORT int PCOUNTJ 

IMPORT SEM_ID PFLAG; 

IMPORT SEM_ID DQFLAG; 

EXPORT roilmage *leftGnSrc[ MaxLevel] ,*leftLn[ MaxLevel] , 

* lef tGnDest [ MaxLevel ] ; 

EXPORT roilmage * rightGnSrc[ MaxLevel ] , 

*rightLn[ MaxLevel] , 
*rightGnDest[MaxLevel] ; 

ini tLe ftPyr amid ( ) 

{ 

int level; 

XY_PAIR position, lastPosition, size, size2; 

size.x ** 512; 

size.y = 480; 

size2.x * size.x»l; 

size2.y = size.y»l; 

position.x =0; 

position. y =0; 

for (level=0;level<MaxLevel;level++) { 
if ( level==0 ) { 

leftGnSrc[ level] = openRoilmage (0, position, size ) ; 
leftLn[ level] = openRoilmage ( 1, position, size ) ; 






lef tGnDest [ level ] = openRoilmage ( 8 , position , size2 ) ; 

> 

lastPosition *» position; 
size.x /= 2; 
size.y /=2; 
size2.x /=2; 
size2.y /=2; 
position. x += size.x/2; 
position. y += size.y; 

> 


initRightPyramid { ) 

{ 

int level; 

XY_PAIR position, lastPosition, size, size2 ; 

size.x = 512; 

size.y ' = 480; 

size2.x = size.x»l; 

size2.y = size.y»l; 

position. x = 0; 

position. y = 0; 

lastPosition = position; 

position. x = 256; 

for (level=0;level<MaxLevel;level++) { 
if ( level==0 ) { 

rightGnSrc [ level ] = openRoilmage ( 4 , lastPosition, size ) ; 
rightLn [ level ] = openRoilmage ( 5 , lastPosition , size ) ; 

rightGnDest [ level ] = openRoilmage ( 8 , position , size2 ) ; 
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> 

else { 

rightGnSrc[ level] =openRoiImage( 8, lastPosition, size) ; 
rightLn [ level ] = openRoilmage ( 2 , lastPosition , size ) ; 

rightGnDest[ level ]= openRoilmage (8, position, size2) ; 

> 

lastPosition = position; 
size.x /=2; 
size.y /=2; 
size2.x /=2; 
size2.y /=2; 
position. y += size.y; 

> 

> 

initStereoPyramids ( ) 

{ 

initLeftPyramid( ) ; 
initRightPyramid ( ) ; 

> 

lef tFSDPyr ( times ) 
int times; 

{ 

int level; 
fill (A2 , 0 ) ; 
fill(C0,0) ; 
dgLap ( 0 ) ; 
displayl2 ( ) ; 

for (PCOUNT=times;PCOUNT 1=0 ;PC0UNT++) { 
grab2Show2 (AO , A2 ,C0 ) ; 
for (level=0;level<MaxLevel;level++) { 

fsdImage(leftGnSrc[ level] , lef tLn[ level] , 
leftGnDest[ level] ) ; 

> 

> 


} 
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rightFSDPyr ( times ) 
int times ; 

< 

int level; 
fill(A2,0) ; 
fill(C0,0) ; 
dgLap ( 0 ) ; 
displayl2 ( ) ; 

for ( PCOUNT=time s ; PCOUNT 1=0 ;PCOUNT++) { 
grab2Show2(A0,A2,C0) ; 
for (level*0;level<MaxLevel;level++) { 

f sdlmage ( rightGnSrc [ level ] , rightLn [ level ] , 
rightGnDest[ level] ) ; 

} 

} 

} 

void 

stereoFSD ( times ) 
int times; 

{ 

int level; 
seminit ( PFLAG ) ; 
seminit ( DQFLAG ) ; 
semGive (PFLAG) ; 
semGive ( DQFLAG ) ; 
fill(A2, 0,0, 0,0,0) ; 
fill(C0, 0,0,0, 0,0) ; 
dgLap ( 0 ) ; 
displayl2(); 

for (PCOUNT=times; PCOUNT 1=0 ;PC0UNT++) { 
grab2Show2(A0,A2,C0) ; 
for (level=0;level<MaxLevel;level++) { 

f sdlmage ( rightGnSrc [level] , rightLn [level] , 
rightGnDest[ level] ) ; 
f sdlmage (leftGnSrc[ level ] ,leftLn[ level] , 
leftGnDest[level] ) ; 

> 
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} 

r sRIdle ( roiRight , DQ_P4 OBUS ) ; 
rsRIdle ( roiLef t , DQ_P4 OBUS ) ; 

> 

stereoFSD2 ( times ) 
int times; 

% 

i 

int level; 
fill(A2, 0,0,0) ; 

£111(00,0,0,0) ; 

dgLap ( 0 ) ; 

displayl2 { ) ; 

grab2Show2(A0,A2,C0) ; 

for (level=0;level<MaxLevel;level++) •{ 

fsdImage(rightGnSrc[ level] ,rightLn[ level] , 
rightGnDest[ level] ) ; 

fsdImage(leftGnSrc[ level] , leftLn[ level] , 
leftGnDest [level] ) ; 

> 

for (;times 1=0 ;times — ) { 
grab2Show2 ( A0 , A2 , CO ) ; 
wireRoi7 ( DQ_ALTERN ) ; 
r sWZoom( roiRight, 1 ) ; 

for (level=0;level<MaxLevel;level++) { 

out56inL9RlO (rightGnSrc[ level] ,rightLn( level] , 
rightGnDest[ level] ) ; 

out56inL9RlO ( lef tGnSrc [ level ] , lef tLn [ level ] , 
leftGnDest [level] ) ; 

> 

wireRoi7 (DQ_PRIMARY) ; 
r sWZ oom ( roiRight , 0 ) ; 

> * 
rsRIdle( roiRight, DQ_P4 OBUS) ; 
rsRIdle( roiLef t,DQ_P4 OBUS) ; 

> 

stereoSub ( times ) 



int times; 

{ 
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dgLap ( 0 ) ; 
displayl2 ( ) ; 

for ( PCOUNI=times ; PCOUNT 1=0 ?PCOUNT++) { 
grab2Show2(A0,Al,Bl) ; 
subimage (AO, BO, Al ) ; 

> 

> 


We claim: 

20 

1. A near real-time stereo vision apparatus compris- 
ing: 

first videocamera means for producing a first video 
output image responsive to light from an object 
scene; 25 

second videocamera means for producing a second 
video output image responsive to light from said 
object scene; 

first digitizing means for digitizing said first video 
image, having an input connected to an output of 
said first videocamera means, and having an output 
at which digital representations of said first video 
image appear; 

second digitizing means for digitizing said second 
video image, having an input connected to an out- 35 
put of said second videocamera means, and having 
an output at which digital representations of said 
second video image appear; 
videoprocessing means for successively producing 
sequential left and right stereo Laplacian pyramid 4Q 
images from said digital representations of said first 
and second video signals, having first and second 
inputs connected to said outputs of said first and 
second digitizing means; and 
disparity map calculating means for calculating a 45 
disparity map of said object scene, operatively 
connected to said videoprocessing means and hav- 
ing storage means for storing an array of numerical 
values corresponding to the disparity at each pixel 
of a digital representation of said object scene; 50 
wherein said disparity map can be used to generate 
control signals for semi-autonomous operation of a 
robotic vehicle or the like. 

2. The near real-time stereo vision apparatus of claim 

1 wherein said disparity map calculating means includes 55 
stereo correlation means for correlating said left and 
right stereo Laplacian pyramid images. 

3. The near real-time stereo vision apparatus of claim 

2 additionally comprising an output confidence image 
means for calculating an output confidence image of 60 
said object scene, operatively connected to said stereo 
correlation means and having storage means for storing 

an array of numerical values corresponding to said 
output confidence image. 

4 . The near real-time stereo vision apparatus of claim 65 

3 wherein said stereo correlation means comprises: 
shifting means for shifting said left Laplacian pyramid 

image over a selected range of disparity values 


from d m in to d max to produce a series of shifted 
Laplacian pyramid images; 

subtracting means for subtracting said shifted Lapla- 
cian pyramid images from said right Laplacian 
pyramid image to give a difference image consist- 
ing of a plurality of difference values; 

sum-of-squared-differences (SSD) means for calculat- 
ing a sum-of-squared-differences (SSD) image for 
each value of shift d from d m j n to d max to produce 
a plurality of such SSD images, including squaring 
means for squaring said difference values of a par- 
ticular difference image to produce a plurality of 
squared difference values, and summing means for 
summing, over a local window of predetermined 
width, said plurality of squared difference values; 

minimum finding means for finding the minimum 
over d for said SSD; and 

disparity image output means for using said minimum 
over d for said SSD to output a disparity image 
consisting of a plurality of individual disparity 
values. 

5. The near real-time stereo vision apparatus of claim 

4 wherein said output confidence image means com- 
prises: 

posterior probability calculation means for carrying 
out a posterior probability calculation, operatively 
connected to said sum-of-squared-differences 
(SSD) means; 

selection means for using said minimum over d for 
said SSD and said posterior probability calculation 
to select the probability of the best d, operatively 
connected to said posterior probability calculation 
means and to said minimum finding means; and 

confidence image output means for using said best d 
to output a confidence image. 

6. The near real-time stereo vision apparatus of claim 

5 wherein said minimum finding means is replaced with 
dynamic programming optimization means for estimat- 
ing disparity jointly over a row of pixels in a scanline, 
operatively connected to said sum-of-squared-differ- 
ences (SSD) means. 

7. The near real-time stereo vision apparatus of claim 

6 wherein said first and second digitizing means and said 
videoprocessing means include image processing VME 
boards comprising: 

first and second A/D digitizer boards; 

a frame store board for storing three video frames, 
operatively connected to said first and second A/D 
digitizer boards; 



5 , 179,441 


33 

first and second region-of-interest (ROI) eight-frame 
store boards for doing subsampling, operatively 
connected to said first and second A/D digitizer 
boards and to said frame store board; 
a video finite impulse response (VFIR) board capable 
of lowpass filtering at video rates, operatively con- 
nected to said first and second region-of-interest 
(ROI) eight-frame store boards; and 
a general purpose arithmetic and logic (ALU) board 
for doing subtractions, operatively connected to 
said first and second region-of-interest (ROI) eight- 
frame store boards and to said video finite impulse 
response (VFIR) board. 

8 . The near real-time stereo vision apparatus of claim 
7 additionally comprising mounting means including 
two three-axis rotation platforms, for mounting said 
first and second videocameras so that said cameras can 
be adjustably mounted on a robotic vehicle or the like 
with parallel optical axes and appropriate roll about the 
optical axis. 

9. A method of near real-time stereo vision compris- 
ing: 

producing a first video output signal responsive to 
light from an object scene; 
producing a second video output image responsive to 
light from said object scene; 
digitizing said first video image to produce digital 
representations of said first video image; 
digitizing said second video image to produce digital 
representations of said second video image; 
videoprocessing said digital representations by suc- 
cessively producing sequential Laplacian stereo 
image pyramid pairs from said digital representa- 
tions of said first and second video images; and 
calculating a disparity map of said object scene and 
storing an array of numerical values corresponding 
to the disparity at each pixel of a digital representa- 
tion of said object scene. 

10. The method of claim 9 wherein said calculating of 
said disparity map includes the stereo correlation of said 
left and right stereo Laplacian pyramid images from 
said digital representations of said first and second video 
images, comprising: 

shifting said left Laplacian pyramid image over a 
selected range of disparity values from d m in to d max 
to produce a series of shifted Laplacian pyramid 
images; 

subtracting said shifted Laplacian pyramid images 
from said right Laplacian pyramid image to give a 
difference image consisting of a plurality of differ- 
ence values corresponding to individual pixel val- 
ues; 

squaring said difference values of a particular differ- 
ence image to produce a plurality of squared differ- 
ence values; 

summing, over a local window of predetermined 
width, said plurality of squared difference values; 
calculating a sum-of-squared-differences - (SSD) 
image for each value of shift d from d m / n to d max to 
produce a plurality of such SSD images; 
finding the minimum over d for said SSD and carry- 
ing out a posterior probability calculation; 
using said minimum over d for said SSD and said 
posterior probability calculation to select the prob- 
ability of the best d; and 
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using said minimum over d for said SSD to output a 
disparity image, consisting of a plurality of individ- 
ual disparity values. 

11. The method of claim 10 wherein said finding said 
minimum is replaced by dynamic programming opti- 
mizing, and wherein said calculating of said disparity 
map includes estimating disparity jointly over a row of. 
pixels in a scanline. 

12. ' The method of claim 9 additionally comprising: 

calculating an output confidence image of said object 

scene, and 

storing an array of numerical values corresponding to 
said output confidence image of said object scene, 
and 

storing an array of numerical values corresponding to 
said output confidence image. 

13. The method of claim 12 wherein said calculating 
an output confidence image of said object scene com-, 
prises: 

using said minimum over d for said SSD and said 
posterior probability calculation to select the prob- 
ability of the best d. 

14. The method of claim 9 wherein said videoprocess- 
ing operates on said digital representations of said first 
and second video images in the form of digitized image 
arrays (Go )ieft and (Go bright, comprising: 

(a) low-pass filtering said (Go)*/: to produce an image 
(G’i )left; 

(b) subtracting said (Gi)*//from (Go)*// to produce an 
image (Lo )*/ which retains high-frequency infor- 
mation of (Go)*/; 

(c) subsampling said (Gi )*//, by taking every other 
pixel in every other line to remove the total num- 
ber of pixels in the image by a factor of four and 
produce a subsampled (Gi )*/; 

(d) lowpass-filtering said subsampled (Gi )*/: to pro- 
duce a filtered image (G 2 )*/*; 

(e) subtracting said (G 2 )*/: from said subsampled 
(Gi)*/r to produce an image (Li)*/ r which retains 
high-frequency information of (Gi )*/; 

(f) subsampling said (G 2 )*/:, by taking every other 
pixel in every other line to remove the total num- 
ber of pixels in the image by a factor of four and 
produce a subsampled (G 2 )*/:; 

(g) lowpass-filtering said subsampled (G 2 )*/r to pro- 
duce a filtered image (G 3 )*/:; 

(h) subtracting said (G3)*// from subsampled (G 2 )*/: 
to produce an image (L 2 )*/: which retains high-fre- 
quency information of (G2)*/:; and 

(i) performing steps (a) through (h) as described 
above starting with (Go )right> 

15. The method of claim 14 wherein in terms of an 
inverse frame rate of said videocameras, the following 
sequence of events occurs; during a first frametime said 
stereo pair (Go)*/: and (Go )right are acquired, during a 
second frametime said (Lo)*/and (Gi)*/are calculated, 
during a third frametime said (Lq) right and (Gi)ngA: are 
calculated, and during a fourth frametime and subse- 
quently the remaining Laplacian pyramid images are 
calculated. 

16. The method of claim 14, additionally comprising 
further recursion of steps (a) through (i) to give left and 
right Laplacian pyramid imaged L 3 , L 4 , and so forth ad 
libitum. 


* * * * * 



